Skip to content

[7/n] [sync-switch] add single-transaction switch config read (not yet wired up)#10648

Merged
sunshowers merged 5 commits into
mainfrom
sunshowers/spr/7n-sync-switch-add-single-transaction-switch-config-read-not-yet-wired-up
Jun 29, 2026
Merged

[7/n] [sync-switch] add single-transaction switch config read (not yet wired up)#10648
sunshowers merged 5 commits into
mainfrom
sunshowers/spr/7n-sync-switch-add-single-transaction-switch-config-read-not-yet-wired-up

Conversation

@sunshowers

Copy link
Copy Markdown
Contributor

Created using spr 1.3.6-beta.1

[skip ci]
Created using spr 1.3.6-beta.1
Comment on lines +577 to +579
// Every switch port. The bootstore config is rack-global,
// so this is intentionally not filtered by rack.
let ports = load_all_switch_ports(&conn).await?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is correct w.r.t. the multirack plans (cc @andrewjstone to confirm/deny). IIRC each rack will have its own bootstore, since it's tied to the bootstrap network / trust quorum for replication?

I'm also not sure this is particularly actionable given the stack of work and current shape of the bg task. Maybe just worth rewording this comment and/or filing an issue on the multirack board for figuring out what to do here moving forward?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Andrew and I talked about it in person last week and I need to update this comment. (Yeah, each rack will have its own bootstore.)

@sunshowers sunshowers changed the base branch from sunshowers/spr/main.7n-sync-switch-add-single-transaction-switch-config-read-not-yet-wired-up to main June 29, 2026 18:28
Created using spr 1.3.6-beta.1
Comment thread nexus/db-queries/src/db/datastore/switch_port.rs Outdated
.await?;
set_entry.insert(announcements);
}
let announcements = announce_cache[&set_id].clone();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't panic because we always insert something if this set_id was vacant, but it still makes me nervous; could we shuffle this around so we don't use the indexing operator without losing any clarity? Untested but something like

let announcements = match announce_cache.entry(set_id) {
    hash_map::Entry::Vacant(set_entry) => {
        let announcements =
            announce_dsl::bgp_announcement
                .filter(
                    announce_dsl::announce_set_id
                        .eq(set_id),
                )
                .select(BgpAnnouncement::as_select())
                .load_async::<BgpAnnouncement>(&conn)
                .await?;
        set_entry.insert(announcements).clone()
    }
    hash_map::Entry::Occupied(set_entry) => {
        set_entry.get().clone()
    }
};

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good idea.

Created using spr 1.3.6-beta.1
Created using spr 1.3.6-beta.1
@sunshowers sunshowers enabled auto-merge (squash) June 29, 2026 20:54
@sunshowers sunshowers merged commit 79bd0cf into main Jun 29, 2026
19 checks passed
@sunshowers sunshowers deleted the sunshowers/spr/7n-sync-switch-add-single-transaction-switch-config-read-not-yet-wired-up branch June 29, 2026 22:35
sunshowers added a commit that referenced this pull request Jun 30, 2026
Part two of the fix for #10640. This wires up what was introduced in the
previous commit, #10648, dropping the piecemeal reads. We aren't totally
out of the woods yet, because the transactional read currently doesn't
handle transient db issues as well as it could. We're going to fix that
in a principled fashion in upcoming commits.

Note that I haven't touched the mgd and other parts of the switch config
in this commit, since @jgallagher is actively working on them. (Let me
know if I should also include those changes either here or in a
followup.)

Depends on:

* #10642
* #10643
* #10644
* #10645
* #10646
* #10647
* #10648
sunshowers added a commit that referenced this pull request Jun 30, 2026
… silently dropping items (#10650)

Previously, on encountering a transient DB issue, we would silently drop
items from the bootstore config. That seems quite obviously bad and was
part of the reason #10640 escalated a crdb bug into persistent bootstore
corruption.

Rework all this by:

* collecting all the problems that occurred
* returning a new error type if there were any problems

**This is a functional change that is worth thinking very hard about.**
Many but not all of the invariants here are maintained by the database
layer. If there's persistent db corruption or a similar issue, we will
no longer generate a bootstore config for the rack! I think this is
overall the right thing to do, but it is quite possible there are latent
issues that were being hidden by the looser error handling that this
code previously had.

The report is now logged by the background task, unless the task exits
with one of the other early errors. (There's a much larger rework of the
task that needs to be done, but I'm not doing that here since it'll
collide with a lot of what @jgallagher is currently doing.)

Depends on:

* #10642
* #10643
* #10644
* #10645
* #10646
* #10647
* #10648
* #10649
sunshowers added a commit that referenced this pull request Jul 1, 2026
…rts newtype (#10651)

In prior commits in the stack, we made it so that transient DB errors
don't result in a partially-generated config. This change encodes one
specific form of that (non-empty ports) into the type system.

Making bootstore deserialization fallible is an improvement in this
case, because it turns a panic into a loop.

Also include a change to make the rack ID in `StartSledAgentRequest` a
typed `RackUuid`. This isn't directly related but is spiritually similar
and I'm touching this code anyway.

Depends on:

* #10642
* #10643
* #10644
* #10645
* #10646
* #10647
* #10648
* #10649
* #10650
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants